Module 07
UCI Machine Learning Repository
Number of Attributes: 5
1. Number of O-rings at risk on a given flight
2. Number experiencing thermal distress
3. Launch temperature (degrees F)
4. Leak-check pressure (psi)
5. Temporal order of flight
variables <- c("total_num", "oring_distress", "launch_temp", "leak_pressure", "flight_order")
oring_data <- "datasets/challenger+usa+space+shuttle+o+ring/o-ring-erosion-or-blowby.data"
shuttle <- read_table(oring_data, col_names = variables) |>
mutate(distress_binary = if_else(oring_distress > 0, 1, 0))shuttle |>
ggplot(aes(x = launch_temp, y = distress_binary)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Challenger O-Ring Data",
subtitle = "Predicted probabilities lie outside [0, 1]",
x = "Launch Temperature (°F)",
y = "Probability of Distress") +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
scale_x_continuous(breaks = seq(50, 90, by = 5))shuttle |>
ggplot(aes(x = launch_temp, y = distress_binary)) +
geom_point() +
stat_smooth(method="glm", method.args = list(family="binomial"), se = FALSE) +
labs(title = "Challenger O-Ring Data",
subtitle = "Predicted probabilities lie between 0 and 1",
x = "Launch Temperature (°F)",
y = "Probability of Distress") +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
scale_x_continuous(breaks = seq(50, 90, by = 5))The logistic regression models the probability of O-ring distress given the Launch Temperature.
\[ \text{Pr}(\text{Distress} = \text{Yes} | \text{Temperature}) \]
We initially tried a linear regression model to represent the probability of distress.
\[ p(X) = \beta_0 + \beta_1 X \]
The linear model produced values outside the [0, 1] range.
The logistic regression model uses the logistic function to model the probability of distress.
\[ p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
This function produces values between 0 and 1 and we can easily plot the function in ggplot.
Starting with the logistic function:
\[ p(X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}} \]
After some algebra:
\[ \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X} \]
The quantity on the LHS is called the odds.
Taking the natural log of both sides:
\[ \log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X \]
The quantity on the LHS is called the log-odds or logit. The logistic regression model above has a logit that is linear in \(X\).
The logistic regression model is estimated using the maximum likelihood method.
The likelihood function is:
\[ \mathcal{L}(\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i) \prod_{i':y_{i'}=0} (1 - p(x_{i'})) \]
Taking the natural logarithm, we get the log-likelihood function:
\[ \ell(\beta_0, \beta_1) = \sum_{i:y_i=1} \log(p(x_i)) + \sum_{i':y_{i'}=0} \log(1 - p(x_{i'})) \]
The goal is to find \(\beta_0\) and \(\beta_1\) that maximize \(\ell(\beta_0, \beta_1)\). This is typically done by:
Taking the partial derivatives of \(\ell(\beta_0, \beta_1)\) with respect to \(\beta_0\) and \(\beta_1\).
Setting these derivatives to zero and solving the resulting equations.
Since these equations are nonlinear, iterative methods or gradient descent are used to find the solution.
The solution \((\hat{\beta_0}, \hat{\beta_1})\) that maximizes \(\ell(\beta_0, \beta_1)\) is the maximum likelihood estimate of the parameters.
Call:
glm(formula = distress_binary ~ launch_temp, family = "binomial",
data = shuttle)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429 7.3786 2.039 0.0415 *
launch_temp -0.2322 0.1082 -2.145 0.0320 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28.267 on 22 degrees of freedom
Residual deviance: 20.315 on 21 degrees of freedom
AIC: 24.315
Number of Fisher Scoring iterations: 5
shuttle |>
ggplot(aes(x = launch_temp, y = distress_binary)) +
geom_point() +
stat_smooth(method="glm", method.args = list(family="binomial"), se = FALSE) +
labs(title = "Challenger O-Ring Data",
subtitle = "Predicted probabilities lie between 0 and 1",
x = "Launch Temperature (°F)",
y = "Probability of Distress") +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
scale_x_continuous(breaks = seq(50, 90, by = 5))Similar to the linear regression model, we can include multiple predictors in the logistic regression model.
\[ log(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p \]
Expressed in terms of \(p(X)\):
\[ p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p}} \]
UCI Machine Learning Repository
Call:
glm(formula = quality_binary ~ ., family = "binomial", data = train_wine_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.383e+02 1.324e+02 1.044 0.296414
`fixed acidity` 1.992e-01 1.516e-01 1.314 0.188808
`volatile acidity` -3.099e+00 9.724e-01 -3.187 0.001437 **
`citric acid` 2.592e-01 1.030e+00 0.252 0.801333
`residual sugar` 1.970e-01 9.038e-02 2.179 0.029300 *
chlorides -1.134e+01 5.162e+00 -2.198 0.027980 *
`free sulfur dioxide` 2.786e-02 1.613e-02 1.727 0.084093 .
`total sulfur dioxide` -2.754e-02 7.131e-03 -3.862 0.000112 ***
density -1.502e+02 1.353e+02 -1.110 0.266958
pH -1.397e-01 1.222e+00 -0.114 0.908994
sulphates 3.370e+00 6.972e-01 4.833 1.34e-06 ***
alcohol 7.706e-01 1.581e-01 4.874 1.09e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 874.3 on 1118 degrees of freedom
Residual deviance: 603.1 on 1107 degrees of freedom
AIC: 627.1
Number of Fisher Scoring iterations: 6
A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.
| True: 0 | True: 1 | |
|---|---|---|
| Predicted: 0 | TN | FP |
| Predicted: 1 | FN | TP |
Accuracy = \(\frac{TP + TN}{TP + TN + FP + FN}\)
Precision = \(\frac{TP}{TP + FP}\)
Recall = \(\frac{TP}{TP + FN}\)
F1 Score = \(2 \times \frac{Precision \times Recall}{Precision + Recall}\)
| Name | Definition | Synonyms |
|---|---|---|
| False Positive Rate (FPR) | \(\frac{FP}{FP + TN}\) | Type I Error Rate, 1 - Specificity |
| True Positive Rate (TPR) | \(\frac{TP}{TP + FN}\) | 1 - Type II error, Power, Sensitivity, Recall |
| Positive Predictive Value (PPV) | \(\frac{TP}{TP + FP}\) | Precision, 1 - false discovery proportion |
| Negative Predictive Value (NPV) | \(\frac{TN}{TN + FN}\) | 1 - false omission proportion |
predictions_wine <- predict(logit_model_wine, newdata = test_wine_data, type = "response")
predicted_classes_wine <- ifelse(predictions_wine > 0.5, 1, 0)
conf_matrix_wine <- table(Predicted = predicted_classes_wine, Actual = test_wine_data$quality_binary)
accuracy_wine <- sum(diag(conf_matrix_wine)) / sum(conf_matrix_wine)
precision_wine <- conf_matrix_wine[2,2] / sum(conf_matrix_wine[2,])
recall_wine <- conf_matrix_wine[2,2] / sum(conf_matrix_wine[,2])
f1_score_wine <- 2 * (precision_wine * recall_wine) / (precision_wine + recall_wine)
print(paste("Accuracy:", round(accuracy_wine, 4)))[1] "Accuracy: 0.8833"
[1] "Precision: 0.697"
[1] "Recall: 0.3333"
[1] "F1 Score: 0"
predicted_probs <- predict(logit_model_wine, newdata = test_wine_data, type = "response")
roc_obj <- roc(test_wine_data$quality_binary, predicted_probs, quiet = TRUE)
optimal_cutoff <- coords(roc_obj, "best", best.method = "closest.topleft")
roc_df <- data.frame(
FPR = 1 - roc_obj$specificities,
TPR = roc_obj$sensitivities
)
ggplot(roc_df, aes(x = FPR, y = TPR)) +
geom_line() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
labs(
title = "ROC Curve for Wine Quality Data",
subtitle = "Multiple Logistic Regression Model",
x = "False Positive Rate",
y = "True Positive Rate"
) +
theme_light() +
annotate("text", x = 0.8, y = 0.2,
label = paste("AUC =", round(auc(roc_obj), 3))) +
annotate("text", x = 0.8, y = 0.1,
label = paste("Optimal cutoff =", round(optimal_cutoff$threshold, 3)))The receiver operating characteristic (ROC) curve is a plot of the True Positive Rate (TPR) against the False Positive Rate (FPR) for different cutoff (threshold) values. True positive rate is also known as the sensitivity; the false positive rate is also known as the 1 - specificity.
\[ \text{True Positive Rate} = \frac{TP}{TP + FN} \]
\[ \text{False Positive Rate} = \frac{FP}{FP + TN} \]
predicted_classes_wine_optimal <- ifelse(predictions_wine > optimal_cutoff$threshold, 1, 0)
conf_matrix_wine_optimal <- table(Predicted = predicted_classes_wine_optimal, Actual = test_wine_data$quality_binary)
accuracy_wine_optimal <- sum(diag(conf_matrix_wine_optimal)) / sum(conf_matrix_wine_optimal)
precision_wine_optimal <- conf_matrix_wine_optimal[2,2] / sum(conf_matrix_wine_optimal[2,])
recall_wine_optimal <- conf_matrix_wine_optimal[2,2] / sum(conf_matrix_wine_optimal[,2])
f1_score_wine_optimal <- 2 * (precision_wine_optimal * recall_wine_optimal) / (precision_wine_optimal + recall_wine_optimal)
print(paste("Accuracy (Optimal Cutoff):", round(accuracy_wine_optimal, 4)))[1] "Accuracy (Optimal Cutoff): 0.8"
[1] "Precision (Optimal Cutoff): 0.4029"
[1] "Recall (Optimal Cutoff): 0.8116"
[1] "F1 Score (Optimal Cutoff): 1"
K-nearst neighbors is a non-parametric method that classifies an observation based on the majority class of its \(k\) nearest neighbors.
We can apply the KNN algorithm to the wine quality data.
train_wine_data_knn <- train_wine_data |> select(-quality_binary)
test_wine_data_knn <- test_wine_data |> select(-quality_binary)
knn_pred <- knn(train = train_wine_data_knn,
test = test_wine_data_knn,
cl = train_wine_data$quality_binary,
k = 5)
conf_matrix_knn <- table(Predicted = knn_pred, Actual = test_wine_data$quality_binary)
accuracy_knn <- sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
precision_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[2,])
recall_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[,2])
f1_score_knn <- 2 * (precision_knn * recall_knn) / (precision_knn + recall_knn)
print(paste("Accuracy (KNN):", round(accuracy_knn, 4)))[1] "Accuracy (KNN): 0.8521"
[1] "Precision (KNN): 0.475"
[1] "Recall (KNN): 0.2754"
[1] "F1 Score (KNN): 0"
“Loop” over different values of \(k\) and evaluate the model accuracy and precision.
train_wine_data_knn <- train_wine_data |> select(-quality_binary)
test_wine_data_knn <- test_wine_data |> select(-quality_binary)
k_values <- seq(1, 19, 2)
knn_metrics <- map(k_values, function(k) {
knn_pred <- knn(train = train_wine_data_knn,
test = test_wine_data_knn,
cl = train_wine_data$quality_binary,
k = k)
conf_matrix_knn <- table(Predicted = knn_pred, Actual = test_wine_data$quality_binary)
accuracy_knn <- sum(diag(conf_matrix_knn)) / sum(conf_matrix_knn)
precision_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[2,])
recall_knn <- conf_matrix_knn[2,2] / sum(conf_matrix_knn[,2])
f1_score_knn <- 2 * (precision_knn * recall_knn) / (precision_knn + recall_knn)
tibble(
k = k,
Accuracy = accuracy_knn,
Precision = precision_knn,
Recall = recall_knn,
F1_Score = f1_score_knn)
}
) |> list_rbind()
# print results
knn_metrics\[
\text{Precision} = \frac{TP}{TP + FP} \quad \quad \quad
\text{Recall} = \frac{TP}{TP + FN} \quad \quad \quad \quad
\text{F1 Score} = 2 \times \frac{Precision \times Recall}{Precision + Recall}
\]
k=1 has the highest recall and F1, which is probably most important in this context. (We should always be aware of the potential for over fitting the data.)
(Included here as it is another example where linear regression is not appropriate.)
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
bike_share_hour |>
filter(yr == 0) |>
ggplot(aes(x = hr, y = cnt)) +
geom_point(color = "skyblue", position = position_jitter(width = 0.1), alpha = 1/3) +
geom_smooth(method = "lm",formula = y ~ x ,se = FALSE) +
# geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) +
# geom_smooth(method = "loess", formula = y ~ x, se = FALSE, span = 0.7) +
labs(
title = "Bike Sharing Data (2011)",
x = "Hour of the Day",
y = "Count of Bikes Rented"
) +
theme_light()
Call:
glm(formula = cnt ~ hr, family = "poisson", data = bike_share_hour_poisson)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.762295 0.008022 469.01 <2e-16 ***
hr1 -0.483265 0.012999 -37.18 <2e-16 ***
hr2 -0.821901 0.014645 -56.12 <2e-16 ***
hr3 -1.453588 0.018840 -77.15 <2e-16 ***
hr4 -2.077436 0.024793 -83.79 <2e-16 ***
hr5 -1.080652 0.016071 -67.24 <2e-16 ***
hr6 0.291584 0.010603 27.50 <2e-16 ***
hr7 1.292880 0.009051 142.85 <2e-16 ***
hr8 1.809838 0.008650 209.23 <2e-16 ***
hr9 1.336799 0.009009 148.39 <2e-16 ***
hr10 1.112019 0.009241 120.33 <2e-16 ***
hr11 1.287031 0.009056 142.11 <2e-16 ***
hr12 1.485279 0.008877 167.32 <2e-16 ***
hr13 1.487314 0.008875 167.58 <2e-16 ***
hr14 1.445238 0.008910 162.20 <2e-16 ***
hr15 1.476453 0.008884 166.19 <2e-16 ***
hr16 1.695506 0.008719 194.45 <2e-16 ***
hr17 2.094714 0.008496 246.55 <2e-16 ***
hr18 2.013104 0.008538 235.78 <2e-16 ***
hr19 1.703100 0.008718 195.35 <2e-16 ***
hr20 1.391299 0.008959 155.29 <2e-16 ***
hr21 1.140324 0.009209 123.82 <2e-16 ***
hr22 0.880108 0.009534 92.31 <2e-16 ***
hr23 0.474563 0.010206 46.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1052921 on 8644 degrees of freedom
Residual deviance: 413848 on 8621 degrees of freedom
AIC: 466934
Number of Fisher Scoring iterations: 5
poisson_coefs <- tidy(poisson_model) |>
filter(term != "(Intercept)")
ggplot(poisson_coefs, aes(x = seq_along(estimate), y = estimate)) +
geom_point(color = "blue") +
geom_line(color = "blue") +
labs(
title = "Poisson Regression Coefficients",
x = "Hour of the Day",
y = "Coefficient"
) +
theme_light()# mean of predicted values for each hour
predicted_values <- predict(poisson_model, newdata = bike_share_hour_poisson, type = "response")
predicted_values_mean <- bike_share_hour_poisson |>
mutate(predicted_values = predicted_values) |>
group_by(hr) |>
summarize(mean_predicted_values = mean(predicted_values)) |>
ungroup()
bike_share_hour |>
ggplot() +
geom_point(aes(x = hr, y = cnt), color = "skyblue", position = position_jitter(width = 0.1), alpha = 1/3) +
geom_line(aes(x = as.numeric(hr) - 1, y = mean_predicted_values), data = predicted_values_mean) +
labs(
title = "Mean of Predicted Values (Poisson Regression)",
x = "Hour of the Day",
y = "Mean of Predicted Values"
) +
theme_light()UCI Machine Learning Repository
From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible: (“Mushroom” 1987)
mushroom_file <- "datasets/mushroom/agaricus-lepiota.data"
mushroom_names <- c("class", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_attachment", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surface_above_ring", "stalk_surface_below_ring", "stalk_color_above_ring", "stalk_color_below_ring", "veil_type", "veil_color", "ring_number", "ring_type", "spore_print_color", "population", "habitat")
mushroom_df <- read_csv(mushroom_file,
na = "?",
col_names = mushroom_names,
col_types = cols(.default = col_factor())) |>
mutate(poisonous = as_factor(if_else(class == "p", 1, 0))) |>
relocate(poisonous, .after = class) |>
select(-class, -veil_type, -cap_color, -odor, -gill_color, -stalk_color_above_ring, -stalk_color_below_ring,
-veil_color, -ring_type)
# mushroom_df |>
# group_by(poisonous, odor) |>
# count() |>
# ggplot() +
# geom_col(aes(odor, n, fill = poisonous), position = "dodge")
mushroom_df# # Load necessary libraries
# library(caret)
# library(pROC)
#
# # Assuming your dataframe is called 'mushroom_df'
# # and the target variable is called 'poisonous' (1 for poisonous, 0 for edible)
#
# # Split the data into training and testing sets
# set.seed(42) # for reproducibility
# train_index <- createDataPartition(mushroom_df$poisonous, p = 0.7, list = FALSE)
# train_data <- mushroom_df[train_index, ]
# test_data <- mushroom_df[-train_index, ]
#
# # Fit the logistic regression model
# model <- glm(poisonous ~ ., data = train_data, family = binomial)
#
# # Summary of the model
# summary(model)
#
# # Make predictions on the test set
# predictions <- predict(model, newdata = test_data, type = "response")
#
# # Convert probabilities to binary predictions
# predicted_classes <- ifelse(predictions > 0.5, 1, 0)
#
# # Evaluate the model
# conf_matrix <- table(Predicted = predicted_classes, Actual = test_data$poisonous)
# accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
# precision <- conf_matrix[2,2] / sum(conf_matrix[2,])
# recall <- conf_matrix[2,2] / sum(conf_matrix[,2])
# f1_score <- 2 * (precision * recall) / (precision + recall)
#
# # Print evaluation metrics
# print(paste("Accuracy:", accuracy))
# print(paste("Precision:", precision))
# print(paste("Recall:", recall))
# print(paste("F1 Score:", f1_score))
#
# # Plot ROC curve
# roc_obj <- roc(test_data$poisonous, predictions)
# plot(roc_obj, main = "ROC Curve")
# auc_value <- auc(roc_obj)
# print(paste("AUC:", auc_value))
#
# # Feature importance (based on absolute value of coefficients)
# feature_importance <- abs(coef(model))[-1] # Exclude intercept
# feature_importance <- sort(feature_importance, decreasing = TRUE)
# print("Feature Importance:")
# print(feature_importance)Applied Statistical Techniques